De gekozen dataset is afkomstig van de University of Colorado, Irvine. Dezelfde dataset is gebruikt in een wetenschappelijk artikel, namelijk "Using machine learning techniques to generate laboratory diagnostic pathways—a case study" door Hoffman et al. (2018) en werd gepubliceerd in "Journal of Laboratory and Precision Medicine".

De dataset gaat over patiënten met een leveraandoening, Hepatitis C. De data bestaat uit een groep gezonde proefpersonen en een groep patiënten met leveraandoeningen. De dataset bestaat uit 14 kolommen, waarbij 4 kolommen gaan over de patiënt (geslacht, leeftijd, gezond/ziek en ID) en 10 kolommen gaan over de waardes van de uitgevoerde laboratoriumtesten. Er is data van 615 patiënten.

In [121]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

Data inlezen¶

We beginnen met het inlezen van de data. Deze wordt omgezet in een pandas DataFrame, zodat we een makkelijk overzicht hebben van de data.

In [122]:
hepatitis_c_csv = "HepatitisCdata.csv"

hepatitis_data = pd.read_csv(hepatitis_c_csv, sep=',', header=0)
hepatitis_data.head()
Out[122]:
Unnamed: 0 Category Age Sex ALB ALP ALT AST BIL CHE CHOL CREA GGT PROT
0 1 0=Blood Donor 32 m 38.5 52.5 7.7 22.1 7.5 6.93 3.23 106.0 12.1 69.0
1 2 0=Blood Donor 32 m 38.5 70.3 18.0 24.7 3.9 11.17 4.80 74.0 15.6 76.5
2 3 0=Blood Donor 32 m 46.9 74.7 36.2 52.6 6.1 8.84 5.20 86.0 33.2 79.3
3 4 0=Blood Donor 32 m 43.2 52.0 30.6 22.6 18.9 7.33 4.74 80.0 33.8 75.7
4 5 0=Blood Donor 32 m 39.2 74.1 32.6 24.8 9.6 9.15 4.32 76.0 29.9 68.7

Vervolgens wordt er aan de hand van de dataset een codebook gemaakt, zodat duidelijk is welke units worden gebruikt en wat de afkortingen van de testen betekenen. De dataset bestaat enkel uit patienteninformatie en laboratoriumtesten, deze gaan we allemaal meenemen om later uit te zoeken welke waarden mogelijk gecorreleerd zijn. Aan de hand daarvan gaan we kijken welke bloedwaarden het meeste zeggen over een Hepatitis C infectie.

In [123]:
codebook = {
    "attribute": ["ID", "Category", "Age", "Sex", "ALB", "ALP", "ALT", "AST", "BIL", "CHE", "CHOL", "CREA", "GGT", "PROT"],
    "unit": ["a.u.", "n.a.", "years", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u.", "a.u."],
    "dtype": ["integer", "category", "integer", "category", "float", "float", "float", "float", "float", "float", "float", "float", "float", "float",],
    "description": [
        "Patient ID",
        "Diagnosis (0=Blood Donor, 0s=suspect Blood Donor, 1=Hepatitis, 2=Fibrosis, 3=Cirrhosis)",
        "Age",
        "Sex (M/F)",
        "Albumin",
        "Alkaline phosphatase",
        "Alanine aminotransferase",
        "Aspartate aminotransferase",
        "Bilirubin",
        "Cholinesterase",
        "Cholesterol",
        "Creatinine",
        "Gamma-glutamyltransferase",
        "Protein"
    ]
}

pd.DataFrame(codebook).set_index("attribute")
Out[123]:
unit dtype description
attribute
ID a.u. integer Patient ID
Category n.a. category Diagnosis (0=Blood Donor, 0s=suspect Blood Don...
Age years integer Age
Sex a.u. category Sex (M/F)
ALB a.u. float Albumin
ALP a.u. float Alkaline phosphatase
ALT a.u. float Alanine aminotransferase
AST a.u. float Aspartate aminotransferase
BIL a.u. float Bilirubin
CHE a.u. float Cholinesterase
CHOL a.u. float Cholesterol
CREA a.u. float Creatinine
GGT a.u. float Gamma-glutamyltransferase
PROT a.u. float Protein

We weten uit hoeveel rijen en kolommen de dataset bestaat, om zeker te weten of we de data volledig hebben ingeladen controleren we dit eerst nog even.

In [124]:
hepatitis_data.shape
Out[124]:
(615, 14)

De category in de dataset bevat 5 verschillende opties: 0=Blood Donor, 0s=suspect Blood Donor, 1=Hepatitis, 2=Fibrosis en 3=Cirrhosis. De groep 0s=suspect Blood Donor is een kleine groep, bestaande uit 7 instances. In zowel kaggle als het geciteerde artikel is niet te vinden wat deze groep patiënten een aparte groep maakt: het is niet duidelijk wat er anders is. Om deze reden zullen deze instances uit de dataset gehaald worden. Het toevoegen van de groep aan de 0=Blood Donor maakt het aantal negatieven enkel groter en het toevoegen van de groep aan 1, 2 of 3 maakt dat het machine learning algoritme later mogelijk foutieve uitslagen zal genereren.

In [125]:
hepatitis_data = hepatitis_data[hepatitis_data['Category'] != '0s=suspect Blood Donor']

if any(hepatitis_data['Category'] == '0s=suspect Blood Donor'):
    print("Instances met category '0s=suspect Blood Donor' zitten nog in de DataFrame.")
else:
    print("Instances met category '0s=suspect Blood Donor' zitten niet meer in de DataFrame.")
Instances met category '0s=suspect Blood Donor' zitten niet meer in de DataFrame.

Conclusie: De data is volledig ingeladen en klaar voor de Exploratory Data Analysis.

Exploratory Data Analysis (univariaat)¶

In [126]:
pd.DataFrame({"is na": hepatitis_data.isna().sum()}).T
Out[126]:
Unnamed: 0 Category Age Sex ALB ALP ALT AST BIL CHE CHOL CREA GGT PROT
is na 0 0 0 0 1 18 1 0 0 0 10 0 0 1

Aan deze tabel is te zien hoeveel missing instances er zijn. Maar 5 kolommen missen data, namelijk ALB, ALP, ALT, CHOl en PROT. Daarvan missen ALB, ALT en PROT alle drie maar één instance. CHOL mist 10 instances en ALP mist er 18. Deze aantallen vallen mee op het totaal van 615.

In [108]:
hepatitis_data.describe()
Out[108]:
Unnamed: 0 Age ALB ALP ALT AST BIL CHE CHOL CREA GGT PROT
count 608.000000 608.000000 607.000000 590.000000 607.000000 608.000000 608.000000 608.000000 598.000000 608.000000 608.000000 607.000000
mean 305.363487 47.291118 41.818781 67.821017 27.601318 34.369408 11.474013 8.204885 5.378829 81.513158 38.243914 72.253213
std 176.981084 9.992705 5.406717 25.274423 21.227539 32.622442 19.770558 2.168400 1.119394 49.720652 51.953220 4.932252
min 1.000000 19.000000 20.000000 11.300000 0.900000 12.000000 1.800000 1.420000 1.430000 8.000000 4.500000 51.000000
25% 152.750000 39.000000 39.000000 52.500000 16.400000 21.600000 5.300000 6.950000 4.620000 68.000000 15.700000 69.450000
50% 304.500000 47.000000 42.000000 66.000000 23.000000 25.850000 7.300000 8.270000 5.300000 77.000000 23.250000 72.200000
75% 456.250000 54.000000 45.250000 79.525000 32.750000 32.800000 11.300000 9.585000 6.075000 88.000000 39.200000 75.400000
max 615.000000 77.000000 82.200000 416.600000 258.000000 324.000000 254.000000 16.410000 9.670000 1079.100000 650.900000 90.000000

Per kolom laat deze tabel zien wat het gemiddelde, het totaal etc. per kolom is. Omdat de kolom 'Unnamed' niet van belang is voor de EDA, wordt deze eruit gehaald.

In [127]:
hepatitis_data.drop(hepatitis_data.columns[0], axis=1, inplace=True)
hepatitis_data
Out[127]:
Category Age Sex ALB ALP ALT AST BIL CHE CHOL CREA GGT PROT
0 0=Blood Donor 32 m 38.5 52.5 7.7 22.1 7.5 6.93 3.23 106.0 12.1 69.0
1 0=Blood Donor 32 m 38.5 70.3 18.0 24.7 3.9 11.17 4.80 74.0 15.6 76.5
2 0=Blood Donor 32 m 46.9 74.7 36.2 52.6 6.1 8.84 5.20 86.0 33.2 79.3
3 0=Blood Donor 32 m 43.2 52.0 30.6 22.6 18.9 7.33 4.74 80.0 33.8 75.7
4 0=Blood Donor 32 m 39.2 74.1 32.6 24.8 9.6 9.15 4.32 76.0 29.9 68.7
... ... ... ... ... ... ... ... ... ... ... ... ... ...
610 3=Cirrhosis 62 f 32.0 416.6 5.9 110.3 50.0 5.57 6.30 55.7 650.9 68.5
611 3=Cirrhosis 64 f 24.0 102.8 2.9 44.4 20.0 1.54 3.02 63.0 35.9 71.3
612 3=Cirrhosis 64 f 29.0 87.3 3.5 99.0 48.0 1.66 3.63 66.7 64.2 82.0
613 3=Cirrhosis 46 f 33.0 NaN 39.0 62.0 20.0 3.56 4.20 52.0 50.0 71.0
614 3=Cirrhosis 59 f 36.0 NaN 100.0 80.0 12.0 9.07 5.30 67.0 34.0 68.0

608 rows × 13 columns

Er is te zien dat er nu nog maar 13 kolommen zijn in plaats van 14. Nu de patient ID's uit de dataset zijn gehaald, kan er verder gegaan worden.

In [128]:
hepatitis_data.hist(bins=20, layout=(3, 4), figsize=(16.0, 6.4));
No description has been provided for this image

Te zien is dat age, ALB, CHE, CHOL en PROT redelijk normaal verdeeld zijn, de resterende waardes zijn veel schever verdeeld. Dit zou mogelijk kunnen komen door outliers.

In [129]:
axs = hepatitis_data.boxplot(grid=False, vert=False, figsize=(12.0, 6.0))
axs.set_title("Boxplot verdelingen");
No description has been provided for this image

Aan de boxplot zijn de verdelingen van de waarde te zien. Een aantal testen (bijvoorbeeld CREA) heeft hoge waardes, maar klinisch gezien zou dit prima kunnen passen bij de betreffende patiënt en hoeven dit niet direct outliers te zijn. Om deze reden worden deze waarden gewoon meegenomen in de verdere EDA.

Om te proberen de scheve verhoudingen wat te corrigeren, worden er logtransformaties toegepast. Om hier een visueel beeld bij te krijgen, worden er opnieuw histogrammen per plot getoond.

In [130]:
hepatitis_log = np.log10(hepatitis_data.select_dtypes('number'))

hepatitis_log.hist(bins=20, layout=(3, 4), figsize=(16.0, 6.4));
No description has been provided for this image

Te zien is dat de verdelingen na het uitvoeren van de logtransformatie normaler verdeeld zijn. We gaan hetzelfde doen met de boxplot.

In [131]:
axs = hepatitis_log.boxplot(grid=False, vert=False, figsize=(12.0, 6.0))
axs.set_title("Boxplot verdelingen met logtransformatie");
No description has been provided for this image

Ook hier is te zien dat na de logtransformatie de data beter (normaler) verdeeld is. We gaan daarom door met de logaritmisch getransformeerde waardes voor enkele attributen. Age, ALB, CHE, CHOL en PROT zijn van zichzelf normaal verdeeld en hoeven dus niet gecorrigeerd te worden. De attributen ALP, ALT, AST, BIL, CREA en GGT zijn logaritmisch getransformeerd normaler verdeeld, deze zullen dus wel gecorrigeerd worden.

In [132]:
for attribute in ("ALP", "ALT", "AST", "BIL", "CREA", "GGT"):
    if attribute in codebook["attribute"]:
        newname = "log(" + attribute + ")"
        index = codebook["attribute"].index(attribute)
        codebook["attribute"][index] = newname
        codebook["description"][index] = "Log10-transform of " + codebook["description"][index]
        hepatitis_data.rename(columns={attribute: newname}, inplace=True)
        hepatitis_data[newname] = hepatitis_log[attribute]

pd.DataFrame(codebook).set_index("attribute")
Out[132]:
unit dtype description
attribute
ID a.u. integer Patient ID
Category n.a. category Diagnosis (0=Blood Donor, 0s=suspect Blood Don...
Age years integer Age
Sex a.u. category Sex (M/F)
ALB a.u. float Albumin
log(ALP) a.u. float Log10-transform of Alkaline phosphatase
log(ALT) a.u. float Log10-transform of Alanine aminotransferase
log(AST) a.u. float Log10-transform of Aspartate aminotransferase
log(BIL) a.u. float Log10-transform of Bilirubin
CHE a.u. float Cholinesterase
CHOL a.u. float Cholesterol
log(CREA) a.u. float Log10-transform of Creatinine
log(GGT) a.u. float Log10-transform of Gamma-glutamyltransferase
PROT a.u. float Protein

Aan bovenstaande codebook te zien, is het gelukt om de attributen waar dit nodig was aan te passen naar de logaritmisch getransformeerde waarde.

In [133]:
axs = sns.histplot(hepatitis_data, x="Category", hue="Sex", multiple="stack", shrink=0.8)
axs.set_title("Verdeling diagnoses per geslacht");
plt.xticks(rotation=45)

plt.show()
No description has been provided for this image

Aan het staafdiagram met "Verdeling diagnoses per geslacht" is te zien dat de dataset scheef verdeeld is. Het aantal gezonde patiënten (0=Blood Donor) bevat een veel groter aantal instances dan de aangedane groepen (1=Hepatitis, 2=Fibrosis en 3=Cirrhosis).

Exploratory Data Analysis (bivariaat)¶

In [16]:
sns.pairplot(hepatitis_data, hue ="Sex");
No description has been provided for this image

Uit bovenstaande pairplot is wat informatie te halen. Zo heeft de log(CREA) als enige waarde een afwijkende plot: deze zijn namelijk meer horizontaal of juist verticaal. ALB en PROT lijken een kleine samennhang te hebben, wat ook logisch is, omdat albumine het meest voorkomende eiwit in je bloed is. Daarnaast lijken ook log(AST) en log(GGT) een samenhang te hebben. De rest van de waardes lijkt niet echt een correlatie te hebben.

In [134]:
hepatitis_without_category = hepatitis_data.drop(columns=['Category', 'Sex'])

axs = sns.heatmap(hepatitis_without_category.corr(), annot=True, cmap="coolwarm", vmin=-1.0, vmax=1.0, square=True)
axs.set_title("Paarsgewijze correlaties ($R$)");
No description has been provided for this image

Deze heatmap laat goed zien welke waardes met elkaar correleren en welke totaal niet. De leeftijd correleert in lage mate met log(GGT), CHOL en log(ALP). Tussen de log(GGT) en log(AST) is, zoals ook in de pairplot te zien was, inderdaad correlatie aanwezig. PROT, CHE en log(ALT) correleren alle 3 in lichte mate met elkaar.

Conclusie EDA¶

Om de Exploratory Data Analysis af te kunnen sluiten, wordt eerst nog een korte conclusie gegeven over de EDA. Aan de verschillende plotjes is te zien dat er mogelijke uitschieters aanwezig kunnen zijn in de data. Denk hierbij aan een kreatinine (CREA) van >1000. Echter, dit kan klinisch een correcte waarde zijn door bijvoorbeeld een achterliggend ziektebeeld. Omdat deze informatie niet beschikbaar is, wordt er niet gefilterd op deze uitschieters. Omdat niet geheel duidelijk is wat de groep '0s=suspect Blood donor precies betekent, onderscheid van andere groepen én het een groep met weinig instances betreft, is besloten deze instances en daarmee de groep in zijn geheel te verwijderen. Samenvoegen met een andere groep kan in een later stadium voor foutieve voorspellingen zorgen, waardoor daar niet voor is gekozen. Tot slot lijkt er weinig correlatie tussen de verschillende attributes in de dataset. PROT en ALB laten wel correlatie met elkaar zien, maar dit heeft waarschijnlijk te maken met het feit dat albumine het meest voorkomende eiwit in de bloedbaan is. Andere attributes correleren licht of niet met elkaar.

De data zoals deze aan het eind van de EDA is, is goed genoeg om mee te nemen en door te gaan met de volgende stap: Machine Learning.

Machine Learning¶

Omdat de data onevenredig verdeeld is (de O groep is veel groter dan de 1, 2 en 3 groep), gaan we met behulp van de tool SMOTE extra instances creëren. Hiervoor moeten de NaN waarden eruit gehaald worden.

In [135]:
hepatitis_data = hepatitis_data.dropna(axis='rows')

hepatitis_data.shape
Out[135]:
(582, 13)

We selecteren de leeftijd- en laboratoriumattributen X en stellen de diagnose in als target Y, en zetten die om in numpy arrays. De geslachten veranderen we van M en F naar respectievelijk "0.0" en "1.0", om nominale waarden te voorkomen. Tevens passen we het dtype van X aan naar float, omdat SMOTE hier graag mee werkt.

In [136]:
pd.set_option('future.no_silent_downcasting', True)

X = hepatitis_data.iloc[:, 1:-1].replace({"f": 0.0, "m": 1.0}).to_numpy()
y = hepatitis_data.iloc[:, 0].to_numpy()

X = X.astype(float)

X.shape, X.dtype, y.shape, y.dtype
Out[136]:
((582, 11), dtype('float64'), (582,), dtype('O'))

Voordat er gekeken gaat worden naar welk model het meest geschikt is, wordt er eerst voor gezorgd dat de dataset wat gelijker is verdeeld. Hier is de tool SMOTE voor. Deze zal hiervoor gebruikt worden. Omdat je echter je testdata niet wil aanpassen, split je eerst je data in test- en trainingsdata.

In [137]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Nu de data gescheiden is in test- en trainingsdata, gaan we SMOTE toepassen.

In [138]:
from imblearn.over_sampling import SMOTE

sample_ratios = {
    "0=Blood Donor": 500,
    "1=Hepatitis": 500,
    "2=Fibrosis": 500,
    "3=Cirrhosis": 500
}

smote = SMOTE(sampling_strategy=sample_ratios, k_neighbors=5, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

Er is gekozen voor het toepassen van een dict bij sampling_strategy, om zelf de verhouding te kunnen kiezen tussen de 4 verschillende categoriën. Met behulp van .shape gaan we controleren er echt meer

In [139]:
X_train_smote.shape, y_train_smote.shape
Out[139]:
((2000, 11), (2000,))

De dataset is, te zien aan de .shape, inderdaad uitgebreid. Per category zijn er nu 500 instances. Vanuit hier kan er gekeken worden naar een geschikt machine learning algoritme dat past bij de dataset. Er worden 8 verschillende machine learning modellen gekozen welke allemaal met cross_validate getest worden op geschiktheid van het model bij deze dataset.

In [158]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier

models = [
    DecisionTreeClassifier,
    DummyClassifier,
    GaussianNB,
    KNeighborsClassifier,
    RandomForestClassifier,
    LinearDiscriminantAnalysis,
    QuadraticDiscriminantAnalysis,
    SVC,
    LogisticRegression,
    AdaBoostClassifier
]
In [159]:
from sklearn.model_selection import cross_validate

metric_scores = {}
for model in models:
    scores = cross_validate(model(), X_train_smote, y_train_smote, return_train_score=True)
    for key, val in scores.items():
        scores[key] = val.mean()
    metric_scores[f"{model.__name__}"] = scores
    
pd.DataFrame(metric_scores).T
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\discriminant_analysis.py:935: UserWarning: Variables are collinear
  warnings.warn("Variables are collinear")
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\ensemble\_weight_boosting.py:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
  warnings.warn(
Out[159]:
fit_time score_time test_score train_score
DecisionTreeClassifier 0.017285 0.000805 0.9785 1.000000
DummyClassifier 0.000397 0.000599 0.2500 0.250000
GaussianNB 0.002633 0.000617 0.9580 0.961500
KNeighborsClassifier 0.003051 0.023944 0.9095 0.937375
RandomForestClassifier 0.564807 0.008219 0.9985 1.000000
LinearDiscriminantAnalysis 0.005807 0.001002 0.9520 0.952875
QuadraticDiscriminantAnalysis 0.002763 0.001300 1.0000 1.000000
SVC 0.112474 0.104457 0.6985 0.702500
LogisticRegression 0.128176 0.001809 0.9665 0.965625
AdaBoostClassifier 0.406707 0.014208 0.4055 0.397125

Met Cross Validate wordt onder andere gekeken naar de test- en trainscores per machine learning model. In combinatie met mijn dataset scoren RandomForestClassifier, DecisionTreeClassifier, KNeigborsClassifier, LinearDiscriminantAnalysis, LogisticRegression en GaussianNB heel goed. Hiervan scoren zowel RandomForestClassifier en DecisionTreeClasiffier 1.000000 op de traindata en QuadraticDiscriminantAnalysis zelfs op beide, waardoor dit 'too good to be true' lijkt, bijvoorbeeld door overfitting. Daarom wordt ervoor gekozen deze modellen te passeren. De fit_time en score_time wordt niet meegenomen in de beslissing om een bepaald algoritme wel/niet mee te nemen naar de volgende stap.

LinearDiscriminantAnalysis en SVC zijn meegenomen als test, van tevoren werd verwacht dat deze type modellen niet heel passend zouden zijn bij deze dataset. Voor SVC klopt dit ook wel, al zijn scores van rond de 70% niet heel slecht. Voor LinearDiscriminantAnalysis klopt dit niet helemaal, dit model zit namelijk veel hoger dan verwacht.

We gaan door met LinearDiscriminantAnalysis, LogisticRegression en GaussianNB.

In [144]:
from sklearn.feature_selection import SelectKBest

metric_scores = {}
k = 11
while k:
    X_select = SelectKBest(k=k).fit_transform(X_train_smote, y_train_smote)
    scores = cross_validate(GaussianNB(), X_train_smote, y_train_smote, return_train_score=True)
    for key, val in scores.items():
        scores[key] = val.mean()
    metric_scores[f"{k} features"] = scores
    k -= 1

pd.DataFrame(metric_scores).T
Out[144]:
fit_time score_time test_score train_score
11 features 0.002639 0.000763 0.958 0.9615
10 features 0.002233 0.000985 0.958 0.9615
9 features 0.003296 0.000997 0.958 0.9615
8 features 0.003479 0.001119 0.958 0.9615
7 features 0.003272 0.000997 0.958 0.9615
6 features 0.003287 0.001009 0.958 0.9615
5 features 0.003494 0.001123 0.958 0.9615
4 features 0.002597 0.000997 0.958 0.9615
3 features 0.002957 0.001036 0.958 0.9615
2 features 0.003306 0.001098 0.958 0.9615
1 features 0.003008 0.001400 0.958 0.9615
In [160]:
metric_scores = {}
k = 11
while k:
    X_select = SelectKBest(k=k).fit_transform(X_train_smote, y_train_smote)
    scores = cross_validate(LinearDiscriminantAnalysis(), X_train_smote, y_train_smote, return_train_score=True)
    for key, val in scores.items():
        scores[key] = val.mean()
    metric_scores[f"{k} features"] = scores
    k -= 1

pd.DataFrame(metric_scores).T
Out[160]:
fit_time score_time test_score train_score
11 features 0.006229 0.000916 0.952 0.952875
10 features 0.110903 0.001026 0.952 0.952875
9 features 0.005572 0.001362 0.952 0.952875
8 features 0.005397 0.001005 0.952 0.952875
7 features 0.004206 0.001099 0.952 0.952875
6 features 0.005830 0.001047 0.952 0.952875
5 features 0.006309 0.001005 0.952 0.952875
4 features 0.005009 0.001015 0.952 0.952875
3 features 0.005154 0.000286 0.952 0.952875
2 features 0.004716 0.000999 0.952 0.952875
1 features 0.005148 0.000202 0.952 0.952875
In [162]:
metric_scores = {}
k = 11
while k:
    X_select = SelectKBest(k=k).fit_transform(X_train_smote, y_train_smote)
    scores = cross_validate(LogisticRegression(), X_train_smote, y_train_smote, return_train_score=True)
    for key, val in scores.items():
        scores[key] = val.mean()
    metric_scores[f"{k} features"] = scores
    k -= 1

pd.DataFrame(metric_scores).T
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[162]:
fit_time score_time test_score train_score
11 features 0.036723 0.000599 0.9665 0.965625
10 features 0.039044 0.000825 0.9665 0.965625
9 features 0.038713 0.000399 0.9665 0.965625
8 features 0.049767 0.001038 0.9665 0.965625
7 features 0.043265 0.001748 0.9665 0.965625
6 features 0.056151 0.000899 0.9665 0.965625
5 features 0.051352 0.001102 0.9665 0.965625
4 features 0.056642 0.001199 0.9665 0.965625
3 features 0.043280 0.001024 0.9665 0.965625
2 features 0.047585 0.001111 0.9665 0.965625
1 features 0.042775 0.000399 0.9665 0.965625

Voor alle 3 de modellen geldt dat 2 features even goed werkt als 11 features. Het LogisticRegression model scoort het beste op deze dataset, dus vanaf hier nemen we enkel nog LogisticRegression mee. De volgende stap is het bepalen van welke features het meest geschikt zijn om te voorspellen welke features het meest geschikt zijn voor de voorspelling.

In [184]:
from sklearn.feature_selection import RFE

# Initialize estimator
estimator = LogisticRegression()

# Initialize RFE
rfe = RFE(estimator, n_features_to_select=2)  # Select top 5 features, adjust as needed

# Fit RFE
rfe.fit(X_train_smote, y_train_smote)

# Filter selected features
selected_indices = np.where(rfe.support_)[0]
selected_features = [i for i, selected in enumerate(rfe.support_) if selected]

selected_features
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[184]:
[4, 5]

Kolommen 4 en 5 zijn dus het meest geschikt om te voorspellen welke vorm van Hepatitis een patiënt heeft. Dat zijn ALP en ALT. Tijd om het model te gaan fitten.

In [255]:
X_train_smote_df = pd.DataFrame(X_train_smote)

y = y_train_smote
X = X_train_smote_df.iloc[:, 4:6]

model = LogisticRegression()
model.fit(X, y)
Out[255]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Omdat ik oprecht benieuwd ben of er verschil in de ConfusionMatrix gaat zitten als ik alle features meeneem, train ik nog een tweede model met in plaats van 2 features, alle features.

In [259]:
y = y_train_smote
X_two = X_train_smote_df

model_two = LogisticRegression()
model_two.fit(X_two, y)
C:\Users\DemiS\Documents\School\Schooljaar_2023_2024\2.3.2_machine_learning\Casus_hepatitis\venv\Lib\site-packages\sklearn\linear_model\_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Out[259]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Het fitten is gelukt. Nu door naar de laatste stappen: Predict, ConfusionMatrix en ROC curve.

In [256]:
prediction = model.predict(X)
In [258]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

conf_mat = confusion_matrix(y, prediction)

#ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category']).plot();
disp = ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category'].unique())
disp.plot(cmap='Blues', include_values=True, xticks_rotation='vertical', values_format='d');
No description has been provided for this image

Deze ConfusionMatrix laat goed zien waar dat het model het beste is in het voorspellen van de vormen ´0=Blood Donor' en '3=Cirrhosis'. Bij vorm 0 worden er maar liefst 492 van de 500 goed geclassificeerd, slechts 8 gevallen worden foutief geclassificeerd: 7 als '1=Hepatitis' en 1 als '2=Fibrosis'. Bij '3=Cirrhosis' worden er 471 goed geclassificeerd: 29 van de foutieve classificaties komen terecht in de categorie '2=Fibrosis'. Categoriën 1 en 2 doen het iets minder, met respectievelijk 374 en 392 goed geclassificeerde patiënten.

Nu ga ik exact hetzelfde doen, alleen dan voor het tweede model.

In [261]:
prediction_two = model_two.predict(X_two)
In [262]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

conf_mat = confusion_matrix(y, prediction_two)

#ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category']).plot();
disp = ConfusionMatrixDisplay(conf_mat, display_labels=hepatitis_data['Category'].unique())
disp.plot(cmap='Blues', include_values=True, xticks_rotation='vertical', values_format='d');
No description has been provided for this image

Deze ConfusionMatrix laat toch iets heel anders zien dan de eerste ConfusionMatrix. Categorieën 2 en 3 kunnen met deze hoeveelheid features in dit model volledig goed herkend worden. Bij 0 en 1 doet hij het ook heel goed, maar maakt hier wel een aantal fouten. Bij '0=Blood Donor' worden iets meer fouten gemaakt als bij het eerste ConfusionMatrix: 14 komen in 1 terecht, 6 in 2 en 4 in 3. Ook '1=Hepatitis' maakt wat foutjes, maar bijna 100 minder dan in het eerste model. Daardoor lijken meer features toch beter dan 2 uit het eerste model.

Door naar de ROC curve.

In [276]:
from sklearn.metrics import roc_curve, roc_auc_score

y_prob = model_two.predict_proba(X_two)
y_true = y_train_smote

scores = {"label": [], "AUC": []}
plt.figure(figsize=(6.4, 6.4))
plt.plot([0, 1], [0, 1], ":k")

for index, label in enumerate(hepatitis_data["Category"].unique()):
    y_label = (y_true == label).astype(int)  # Get binary labels for the current category
    fpr, tpr, _ = roc_curve(y_label, y_prob[:, index])
    scores["label"].append(label)
    scores["AUC"].append(roc_auc_score(y_label, y_prob[:, index]))
    plt.plot(fpr, tpr, label=label)

plt.axis("square")
plt.grid(True)
plt.title("ROC-curve")
plt.legend()
plt.show()
No description has been provided for this image
In [277]:
pd.DataFrame(scores).set_index("label")
Out[277]:
AUC
label
0=Blood Donor 0.996131
1=Hepatitis 0.996791
2=Fibrosis 0.998600
3=Cirrhosis 0.999859

Het tweede model heb ik gebruikt voor de ROC curve, omdat die toch beter presteert dan het 1e model met minder features. Dit model kan voor alle 4 de categorieën bijna perfect classificeren (AUC > 0.99).